Constraint-free Graphical Model with Fast Learning Algorithm 
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Abstract 



In this paper, we propose a simple, versa- 
tile model for learning the structure and pa- 
rameters of multivariate distributions from a 
data set. Learning a Markov network from 
a given data set is not a simple problem, 
because Markov networks rigorously repre- 
sent Markov properties, and this rigor im- 
poses complex constraints on the design of 
the networks. Our proposed model removes 
these constraints, acquiring important as- 
pects from the information geometry. The 
proposed parameter- and structure-learning 
algorithms are simple to execute as they are 
based solely on local computation at each 
node. Experiments demonstrate that our al- 
gorithms work appropriately. 
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resents a manifold of distributions. 

M G = {p^piXilX.i) = piXilYi)} (1) 

The Hammersley-Clifford theorem (Besag, 1974) 
proves that this manifold is identical to: 



(2) 



where c is the clique in G, X c denotes variables in c, 
and Z denotes the normalizing constant. 




Figure 1: Learning a Markov Network 
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1 INTRODUCTION 

The purpose of this paper is to propose a versatile 
mechanism for learning multivariate distributions from 
sets of data. This learning mechanism is based on 
a simple parameter-learning algorithm and a simple 
structure-learning algorithm. 

Markov networks are versatile tools for modeling mul- 
tivariate probability distributions, because they do not 
require any assumptions except for the Markov prop- 
erty of the problem. In this paper, we treat finite 
discrete systems; thus all variables take finite discrete 
values. 

However, appropriately learning a Markov network 
from a given data set is not simple (Roller and Fried- 
man, 2009). Let us give a geometrical view of this 
problem. Let Yi be the neighbors of Xi, and X-i be 
the variables in the network except for Xi. Given a 
graph G, the Markov network that has this graph rep- 



Given an empirical distribution n, the role of structure- 
learning is to determine a manifold M.q and the role of 
parameter-learning is to determine a distribution it' G 
M.G- If we consider maximizing likelihood (i.e., min- 
imizing Rullback-Leibler divergence), then structure- 
learning algorithms should place A4q close to ir, and 
parameter- learning algorithms should place n' at: 



7T = arg min KL(w\\p). 

P&Mg 



(3) 



The difficulties that arise in designing learning algo- 
rithms are: 

• 7r' and KL(ir\\ir') have no closed form. 

• The problem of obtaining tt' is not decompos- 
able; in other words, we cannot obtain <f> c inde- 
pendently. 

• It is intractable to find ir' in cases where the num- 
ber of variables is large. 



To avoid these difficulties various approaches have 
been attempted (see Chapter 20.9 in Roller and Fried- 
man, 2009). In this paper, we take a new approach 
that avoids these difficulties. Firstly, we propose a 
new network system, named a firing process network. 
This is not a conventional graphical model, and it is 
obtained by relaxing the constraints of Markov net- 
works. 

In Section 2, we formulate the firing process network 
that the proposed learning algorithms work on, and 
illustrate the information geometry aspects of the fir- 
ing process network. In Section 3, we introduce the 
parameter-learning and structure-learning algorithms, 
as well as aspects of their information geometry. We 
also present some information criteria that ensure the 
structure-learning algorithm is not ovcrfittcd. In Sec- 
tion 4, we show how to draw samples from the model 
distribution, and also show that this model is able to 
draw samples from posterior distributions. Section 5 
provides experimental demonstrations that the pro- 
posed model works appropriately. 

2 FIRING PROCESS NETWORK 

2.1 NODE 




Figure 2: Node i 

The firing process network consists of n nodes. Herein, 
these nodes are indexed by the numbers 0, ...,n — 1. 
Each node has a variable Xi and a conditional proba- 
bility table 8i(Xi\Yi). Node i references other nodes in 
the network, denoted by Yi, and we call information 
source. 

We now assume that Yi = yi. "Firing node i" means 
drawing a sample from distribution Qi(Xi\yi) (see foot- 
noted) and assigning it to the value of Xi . 

2.2 FORMULATION OF A FIRING 
PROCESS NETWORK 

The firing process network is formulated as follows. 

FPN = FPN(G, 6, f p ): Firing process network. 
G: Directed graph. 

1 In this paper, variables are denoted by capitals, 
X, Y, Z, and their values are denoted by lower case, x, y, z. 
We also use shortened forms, such as p(X\y)(= p(X\Y = 

y))- 



X = {Xi}: Nodes in G. 

Yi\ Information source of Xi, i.e., nodes that have 
edges to Xi. 

X)i~. Set of node numbers included in Yi. 
6 = {0i}: Parameters. 

Qi. Conditional probability table 6i(Xi\Yi). 
Wii Linear operator (=matrix) that moves a distribu- 
tion p(X) to the distribution p(X^i)9(Xi \Yi), i.e. 

(pW i )(X)=p(X. i )9 i {X i \Y i ). (4) 

This operator represents the transition matrix caused 
by firing node i. 

f p : Either a sequential firing process or random fir- 
ing process. The firing processes are Markov chains. 
At each time t, one node is chosen and fired. Sim- 
ilarly to the Gibbs sampling (Gilks, Richardson and 
Spiegelhalter 1996), there are at least two methods 
to choose a node to be fired at time t. One method 
is that we choose a node in a sequential and cyclic 
manner, such as 0,...,n — 1,0,..., n — 1, .... In this 
case, the Markov chain is a time-inhomogeneous chain 
because the transition matrix changes at every t, 
such as, Wo, W n -i, Wo, W n -i, ■■■■ We call this 
Markov chain sequential firing process. A sequential 
firing processes is a time-inhomogeneous chain, how- 
ever, if we observe this chain every time n, such as, 
t = i,i + n,i + 2n, ... then the observed subsequence 
is a time-homogeneous chain which has the transition 
matrix: 

W^i-x = Wi...W n -iWo...Wi-i. (5) 

Another method is that we choose a node in a random 
manner. Every time t, a random number i is drawn 
from the distribution c(i) and the node i is firecQ. We 
call this Markov chain random firing process. A ran- 
dom firing process is a time-homogeneous chain, and 
its transition matrix is: 

W^^c^W, (6) 

i 

Let D N = {I ,...,!^ 1 } be a data set where X 1 
denotes X at time t in a firing process, and let pd n 
be the empirical distribution of -Djv- In the case of a 
random firing process, 

3p°°,Vp , lim p°W t =p°° (7) 

t— >oo 

under the assumption of the ergodicity of W. Then, 
by the law of large numbers, 

lim PDn = p°° a.s. (8) 

N— >oo 

2 If there are no special reasons, we use uniform distri- 
bution for c(i). 



In the case of a sequential firing process, 



Vi,3p°°,Vp°, lim p°W? 



-l =Pi 



(9) 



under the assumption of the ergodicity of Wi-^-i. Let 
p°° — ^Pi°, then we again get Eq.®, because the 
data are considered to be drawn equally likely from 
n time-homogeneous chains that each of their state 
distribution converges to p°° . 

We define the model distribution of the firing process 
tt' by the limiting distribution p°° in Eq.(j8|), i.e.: 



(10) 



Markov networks are a special subclass of the firing 
process network. Consider the following constraints. 

Graph constraint All edges in G are bi- directed, i.e., 
if there is an edge from Xj to Xj , then there is an 
edge from Xj to Xj. 

Parameter constraint There exists tt' £ M.g such 
that, for all i, l (X l \Y l ) = Tr'(Xi\Yi). 

If the sequential or the random firing processes run un- 
der these constraint, they are equivalent to the Gibbs 
sampling and the empirical distribution of the samples 
converges to tt'. For a given Markov network, if we re- 
place its edges Xj — Xj with Xj — > Xj and Xj — > Xj, 
then we have a firing process network that is equivalent 
to the given Markov network. 



2.3 INFORMATION GEOMETRY OF A 
FIRING PROCESS NETWORK 

The information geometry (Amari and Nagaoka, 
1993)(Amari, 1995) illustrates important aspects of 
the firing process network. We define a conditional 
part manifold (see Appendix) as: 



E(0i) = {p|p(X|X_. i ) = 6i(Xi\Yi)}. 



(11) 



When a node i is fired, the distribution of X moves 
from p(X)(= p(X_,)p(X i |X_ i )) to p(X_ l )^(X|K i ); 
in other words, the distribution of X moves from p to 
its m-projection onto E{6i). 

In a sequential firing process, let tt^ be the limiting 
distribution (^stationary distribution) of in 
Eq.([5]), or in a random firing process, let tt^ = tt'Wi. 
Then, tt[ is a distribution on E{0i), and tt is a mixture 
of them. This implies that the model distribution of 
the firing process network is determined by n mani- 
folds {E(9i)}, but by a single manifold such as A4q 
in Markov networks. Further, note that each ir^ has 
a rigorous Markov property ^(X^X-i) = 8i(Xi\Yi), 
however, these rigorous Markov properties are lost in 
the model distribution it'. 



3 LEARNING ALGORITHM FOR A 
FIRING PROCESS NETWORK 

3.1 PARAMETER-LEARNING 

Learning algorithms are usually designed by solving 
some optimization problem, i.e., to minimize or to 
maximize some score (e.g. likelihood). However, we 
take a different approach in this paper. Firstly, we 
determine a simple parameter-learning algorithm and 
show that this parameter-learning algorithm works ap- 
propriately under certain conditions. 



Our parameter-learning algorithm is simply: 
9 i {X i \Y i )=ir{X i \Y i ). 



(12) 



Since 7r(Xi|Yi) is an empirical distribution of a data 
set, it is easily obtained by counting the samples in 
the data set. 

In the firing process network, we compare two cases: 
G is complete/not complete. In the case where G is 
complete: 

^•(X i |X_ i )=7r(X|X_i)- (13) 

Thus, the firing process (sequential, random) is equiv- 
alent to "Gibbs sampling" and tt' — tt. In the case 
where G is not complete: 

e i {X i \Y i )=Tr{X i \Yi). (14) 

We call this firing process incomplete Gibbs sampling. 



E(.0o) 



E{9 2 )' E(0 3 ) 

Figure 3: Gibbs Sampling 
E(6 ) . 




E(6 2 y E(6 3 ) 
Figure 4: Incomplete Gibbs Sampling 



Figs[3] and 0] illustrates the notion of information ge- 
ometry of Gibbs sampling and incomplete Gibbs sam- 
pling, respectively, with the sequential firing procest@. 
As described in Section 12.31 each time a node i fires, 
the distribution of X moves to the m-projection onto 
E{0i). Figs|3] and S] give us some intuition: 

• In the Gibbs sampling, every E{9i) intersects at 
7r. Thus, the distribution of X converges to tt. 

• In the incomplete Gibbs sampling, each E(0i) 
docs not pass tt. Thus, the distribution of X 
does not converges to tt. However, if every E(6i) 
is close to tt, then the distribution of X hovers 
around tt, thus, the model distribution tt' is close 
to TT. 



We provide more theoretical evidence for the second of 
these points. For the theoretical simplicity, we treat 
only the random firing process in later parts of this 
paper. 

We define a conditional part manifold E p and a 
marginal part manifold M*: 

E p = {q\q(X t \X^) = piX^X^)} (15) 
M; = {q\q(X_ i )=p(X_ i )} (16) 

and define the KL-divergence between a distribution p 
and a manifold S: 



KL(p\\S) = minKL(ja\\q) = K L(p\\pP m (S)) (17) 
qes 

KL{S\\p) = minKL(q\\p) = K L(pP e (S)\\p) . (18) 




FCD(p\\q) = Y. l <i)KL(p\\El) 

Figure 5: FCD(p\\q) 

Here, we define the following Bregman divergence 
(Censor and Zenios, 1997) (see Appendix) that we call 



full- conditional divergence: 
FGD(p\\q) = B^(p\\q) 

= J2c(i)KL(P\\K) 

i 

= Z«){^4x^)) p > (19) 

v-(p) = E c « ( l °gp( x i\ x -i)) P 

i 

= -^c(i)JJ p (X i |X_ i ) (20) 

i 

where H p (*\*) denotes a conditional entropy (Cover 
and Thomas, 1991). As KL-divergence is a Bregman 
divergence, which has a potential —H p (X), FCD is a 
Bregman divergence. Thus, we can use it as a pseudo- 
distance. 

The following inequality implies that if every condi- 
tional part manifold E[ is close to tt then the model 
distribution tt' is also close to tt. 

Upper bound of FCD) 

FCD{tt\\tt') <^2c(i)KL(TT\\E(9i)). (21) 

i 

Proof) The transition matrix of the random firing pro- 
cess is 

J2<i)Pm(E(9i)). (22) 

i 

The model distribution tt' is clearly equal to the 
limiting distribution(=stationary distribution) of this 
Markov chain. Thus: 

tt' =TT , Y J <i) p UE{e i )). (23) 

i 

Here, we define: 

7^=7r / P m (S(d i )). (24) 



Then: 



(25) 



3 In the figures in this paper, dashed lines represent e- 
geodesics or e-flat manifolds. 



TT 1 = ^]c(i)7r-. 

i 

Now, Consider KL(tt\\tt'). 
KL{tt\\tt') = (log ^7) = ^logvr - log^c(i)7r^ 

= / logTT - ^ C(i) log 71^ ) +Y1 C W (kg 71 *)* 
\ i I tt i 

-(l0g*£c(i)TT^ 

= £c(i)/logj\ J (26) 



where 



J=/log£>(iK\ -J2c(i) (lognl)^ (27) 

\ i I 7T • 

Note that J > 0, by the convexity of — log. 



good model depends on the structure-learning algo- 
rithm, the role of which is to determine {tji} for each 
node i. 

In any machine learning algorithm, we must consider 
two conflicting requirements for constructing a good 
model: 









K i / 


/ ///////////// ' ^ ^\~^^~^/^ 










7T-P e (M*) / 

= TrM p (E(pl)) 

mi/ 







Figure 6: Information Geometry around 7r',7r£ and 7r' 

Fig|6] illustrates information geometric relation be- 
tween tt' ,tt[ and ir. By Pythagoras' theorem in infor- 
mation geometry (Amari and Nagaoka, 1993)(Amari, 
1995), 



KL(tt\\tt') = KL(tt\\EI,) + KL(Ml\\w') 



(28) 



KL(n\W) = KL(n\\E(6i)) + KL(Mi\\ir' i ). (29) 
Note that: 



Subtracting: 



7T(X_ 4 ) 



from Eg. (|21)1) . we get 

FCD{tt\\tt') =J2 c (i) KL (M\ E (9i)) - J- 



(30) 
(31) 

(32) 



Since J > 0, we get the upper bound of the full- 
conditional divergence. 



(End of Proof 



3.2 STRUCTURE-LEARNING 



We have already determined the parameter-learning 
algorithm in the previous subsection, then, forming a 



A. The model distribution tt' should be close to the 

data distribution n. 

B. The complexity of the model should be low to avoid 

overfitting. 

The previous section showed that we should place the 
conditional part manifold E(8i) close to 7r for require- 
ment A. Since 

KL(*\\E(Bi)) = H w (Xi\Yi) - H r (Xi\X-i), (33) 

minimizing KL(ir\ \E{9i)) is equivalent to minimizing 
H„(Xi\Yi). Information theory (Cover and Thomas, 
1991) states that if we add a new node to Yi, then 
H^iXilYi) decreases. However, if we add a new node 
to Yi, then the complexity of the model increases. For 
an ultimate example, if we let Yi = A_, ; then 

• Graph G becomes the complete graph. 

• The firing process becomes Gibbs sampling. 



• 7T 



7r; however, the model is overfitted. 



We thus use the following information criteria to de- 
termine the trade-off between the requirement A and 
requirement B. 

3.2.1 Node-by-node MDL/AIC 

One method of evaluating the goodness of a model 
is to use some general information criteria such, as 
MDL (Minimum Description Length (Rissanen, 2007)) 
or AIC (Akaike Information Criteria (Akaike, 1974)). 
However, it is difficult to apply them directly to the 
whole system, and therefore we apply information cri- 
teria to each node. 

If we treat the conditional manifold E(6i) as a model 
manifold, then the maximum likelihood for a data set 
that has N sample and an empirical distribution tt is: 

N x KL(ir\E(6i)) = N(H n (Xi\Y) - H^X^X^)). 

(34) 

The second term of the right-hand side can be ne- 
glected, because it is a constant in the situation where 
we select Yi. Let ki be the number of free parameters 
in the conditional distribution tables 9(Xi\Yi) in the 
node i: 

h = 1^1(1^1-1) (35) 



where | * | denotes the number of values that * takes. 
We define: 

nnMDLi{t)i) = NH n (Xi\Yi) + h l °^ N (36) 

and call it node-by-node MDL. 
Similarly, we define: 

nnAICi^i) = NH„(Xi\Yi) + k (37) 

and call it node-by-node AIC. 

Which information criteria to use depends on what 
assumptions we have about the underlying real distri- 
bution that the data comes from. We use nnMDL in 
later sections. 

3.2.2 Selecting Information Source 

To find an information source Yi that minimizes 
nnMDLi(t)i), we must examine all combinations of 
variables in X^i, which causes the computational costs 
to rise unacceptably. Therefore, we use the following 
greedy algorithm (written in pseudo-Java). 

9. = 0; 

whilc(true){ 

j — arg minj nnM DL(X)i + 
\i(nnMDL(x)i + {j}) < nnMDL(t)i)){ 
0i = 9i + {j}: 
continue; 

} 

j — arg miiij nnM DL{tyi — {j}); 
ii(nnMDL(t)i - {j}) < nnM DL(t)i)){ 

Ox = tji - {jh 

continue; 

} 

break; 

} 

This algorithm is similar to the forward-backward al- 
gorithms used in feature selection (Guyon and Elisse- 
eff, 2003). 

3.3 NUMBER OF DATA AND MODEL 

In this subsection, we describe the relation between 
the number of data TV and the model distribution tt' . 
When the number of data is small, the second term on 
the right-hand side of Eq.(|36|) dominates nnMDL, and 
thus the number of information sources is suppressed. 

Fig J7] illustrates the relation between the number of 
data N and the proposed model. In this figure TT rea i 
denotes the underlying real distribution that the data 
is drawn from, and dashed lines denote E(6i). Note 
that our goal is not to approximate tt, but rather 7ry ea ;. 
In the ultimate case, rji = for all nodes and G has 
no edges, tt' is equal to the mean field approximation 
of tt, and all E(9i) intersect at n' . As N increases, all 
E(9i) move toward it, and tt' approaches tt. In the case 



N is small) 




Graph G Information geometry 

Figure 7: Number of Data and Proposed Model 
where N — > oo, tt' = 7r = Tr rea i and all E{9j) intersect 

at w' = TT = TT rea l . 

This behavior is reasonable because the model trusts 
tt when it is close to TT rea i- 

3.4 COMPUTATIONAL COST 

The key to the proposed algorithms is that computa- 
tion is independently performed by each node. This 
independence simplifies the situation. 

Constructing a table of Tr(Xi\Yi) requires O(N) com- 
putations, as it is formed by counting TV samples. In 
addition, evaluating H^lX^Yi) requires 0(N). 

Adding a node to an information source of another 
node requires 0(nN) computations, as it requires the 
evaluation of H- n {Xi\Yi) at most n times. Similarly, 
subtracting a node from an information source of a 
node requires 0(nN). 

Experiments show that subtracting nodes from infor- 
mation sources rarely occurs in the structure-learning. 
Thus, approximately |rji| node additions occur during 
the structure-learning of node i. By EqJ3Bl we get: 

H w (Xi\Y) + ^j^- < nnMDLi(<D). (38) 
Thus, k = 0(N/ log N) and: 

- O(logfc) = O (log-^—) = O(logiV). (39) 



Therefore, one node requires 0(nN log N) and the 
whole system requires 0(n 2 N log N) computations for 
the structure-learning. 

To compute n' numerically, we must compute the 
eigenvector for eigenvalue 1 of the \X\ x \X\ transition 
matrix, which requires 0(|X| 3 ) computations. There- 
fore, it is intractable to compute n for large models. 

4 SAMPLING FROM A MODEL 
DISTRIBUTION 

This section describes the use of our model after learn- 
ing data. 

This model is used as a Markov chain Monte Carlo 
method (Gilks, Richardson and Spiegelhalter, 1996). 
We do not compute the model distribution tt' numer- 
ically, but rather draw samples from tt'. This is per- 
formed by the firing process described in Section [5] 

4.1 SAMPLING FROM A POSTERIOR 
DISTRIBUTION 

Here, we separate variables in the network into two 
parts: X = (X_f,Xf). In the Gibbs sampling, if we 
fix the value of Xt to Xt and only fire the nodes in 
X-f, then we can draw samples from 7r(Jf_/ \xf). In 
this paper, we call this partial sampling We can also 
conduct the partial sampling in the proposed model. 

We also separate Yi into two parts: Y = (Yi-f,Yif), 
where Yi-f is the variables included both in Yi and 
X_f, Yif is the variables included both in Yi and Xf. 

Suppose we already finished learning and obtained a 
network FPN A (G, 9, f p ). Let FPN B be the following 
firing process network: 

• Nodes in Xf are removed from FPNa. 

• The conditional probability table of node i is 
OiiXilYi-fVif) in FPN B , while it is Oi(Xi\Yi) in 
FPNa- Values y,/ are fixed by Xf. 

Then, it is clear that the partial sampling on FPNa 
and the normal firing process on FPNb are equivalent. 
Let n"(X-f) be the model distribution of FPNb, and 
E(6if) be the conditional part manifold of node i in 
FPN B , i-e.: 

E{6 if ) = {piX-fMX.^) = BiiXilYi-Mf)} 

(40) 

Here, we can derive the following equation: 
KL{*\\E(6i)) =Y t <x J )KL^{X. f \x f )\\E{9 if )). 

(41) 



This equation shows that the average of 
KL(Tr(X\x f )\\E(6 if )) is equal to KL{iz\\E{6i)). Thus, 
if KL\x\\E{Pi)) is small, then KL(ir(X\xf)\\E(6i f )) 
is, on average, small, and the distribution of the 
samples drawn by the partial sampling converges to 
Tr"(X-f), which is, on average, close to Tr(X_f\xf). 

5 EXPERIMENTS 
5.1 3x3 ISING MODEL 

( ( Xi = ±1 

Krealix) = ^ exp(J J2 < i,j > x i x j), J = 0-5 

Figure 8: 3 x 3 Ising Model 



N = 100) 




(.19, .22, .02) (.18, .19, .01) (.18, .20, .02) (.18, .20, .02) 



Figure 9: Retrieved Structure for 3 x 3 Ising Model 

We used a 3 x 3 Ising model shown in FigJS] for the 
first experiment. In this case, we could compute n' 
numerically, as the size of problem is small. 

We used four data sets, each of which used differ- 
ent random seeds to draw i.i.d. (independent and 
identically distributed) samples from 7r rea ;, as shown 
in FigUl In Fig|9l the figures under the graphs are 
(KL(Tr\\Tr'),KL{Tr\\Tr real ),KL(Tr'\\TT real )). Note that 
in these results: 

• KL(ir\\ir') < KL(tt\ \Tr rea i); i.e., tt is closer to n' 

than to Threat - 

• KL^'W-Kreai) < K L(tt\ |7r rea /); i.e., 7r rea i is closer 
to 7r' than to n. 

5.2 5x5 ISING MODEL 

In learning a multivariate distribution, we often en- 
counter the situation that N <C \X\. Therefore, we 
expanded the previous Ising model to 5 x 5, and sim- 
ilarly formed data sets by i.i.d. sampling with three 
different random seeds. In this case, \X\ = 2 25 and 



it is intractable to compute n' numerically because it 
would require 0(|Y| 3 ) computations. However, we can 
observe how the model learns the structure. 

N = 1000) 




Figure 10: Retrieved Structure for 5 x 5 Ising Model 

FiglTUl shows that the proposed model successfully re- 
trieved the structure from the given data sets, and that 
retrieval depends on TV rather than \X\. 

5.3 ONE-DAY MOVEMENT OF STOCK 
PRICES 

We give the following as an example of a real-world 
problem. 

For a one-day stock price, we define: 



Xi 



1 opening-price < closing-price 
opening-price > closing-price. 



(42) 



We followed 10 stocks and TOPIX (the overall index 
of stock prices on Tokyo Stock Exchange); thus, the 
vector X consists of 11 binary variables. We took 
N = 726 samples from the real market (2009/01/05- 
2011/12/28) and set the proposed model to learn the 
distribution. 

We do not know the real distribution that these sam- 
ples are drawn from. However we can observe the 
graph G constructed by the learning algorithm. 



Astclla; 
DOCOM 

KDDlTOl'l 

SoftBank 
Mitsubishf 




Takcda Astcllas, Takcda: 

Pharmacy companies. 
an Nissan, Toyota, Honda: 

Car manufacturers, 
da Mitsui, Mitsubishi: 
General merchants. 
KDDI, DOCOMO, SoftBank: 
byota Cellphone companies. 

TOPIX: Index of whole market. 

Mitsui 



Figure 11: Learned Structure of one-day Stock Move- 
ment 

In FigfiTl every node takes other nodes in its sec- 
tor or TOPIX as its information source, except for 
DOCOMOoAstellas. 

If we remove TOPIX from the graph, it can be noted 
that the nodes are separated to three groups: Domes- 
tic industry (Pharmacy, Cellphones), Exporting indus- 
try (Car manufacturers) and Importing industry (Gen- 
eral merchants). 



6 CONCLUSION 

The important difference between conventional graphi- 
cal models (Markov networks, Bayesian networks) and 
firing process networks is: 

• In the conventional graphical models, the struc- 
ture determines a single manifold for the entire 
system, and the model distribution is located on 
this manifold. 

• In the firing process networks, each node has a 
manifold respectively, thus the whole system has 
n manifolds, and the model distribution is deter- 
mined by these n manifolds. 

This difference makes the learning algorithms for the 
firing process networks simple; since each node is only 
responsible for its manifold, and it does not need to 
know what other nodes do during learning. 

Future works on the proposed model will include: 

• Comparisons with conventional learning algo- 
rithms that work on conventional graphical mod- 
els. 

• Revised version of learning algorithms. 

• Expansion to continuous models. 

• Theory for sequential firing process. 
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Appendix 

Assumption for convergence of Markov Chain 

In this paper, we often assumed the existence of a 
unique limiting distribution of a Markov chain. Here, 
we describe when this assumption is satisfied. We 
consider only time-homogeneous Markov chains with 
finite state spaces here. The following theorem for 
Markov chains can be found in many text books. 

Markov chain convergence theorem A) 

There exists a unique limiting distribution p°° for any 
initial distribution under assumptions: 

• Whole system is a communicating class. 

• All states are aperiodic. 

However, we cannot use this theorem directly in 
this paper, because the first assumption requires 
Vx,p°°(x) > 0. For example, in the case that the num- 
ber of data is smaller than the size of the range of X, 
the empirical distribution of the data never satisfies 
this assumption. Therefore, we use the following ex- 
tended version. 

Markov chain convergence theorem B) 

There exists a unique limiting distribution p°° for any 
initial distribution under assumptions: 

• The system has a unique closed communicating 
class 

• All states in the communicating class are aperi- 
odic. 

It is easy to expand theorem A to theorem B, because: 

• The system comes into the communicating class 
with probability 1. 

• Once the system comes into the communicating 
class, it will never go out. 

Information Geometry of Joint Probability 

Let X, Y be any stochastic variables, and p(XY) be 
one of their joint probability. By Bayes' rule, p(XY) = 



p(X)p(Y\X). Here, we call p(X) marginal part of p 
and call p(Y\X) conditional part of p. We also define 
two manifolds: 

M p = {q\q(X) = p(X)} 

E p = {q\q(Y\X)=p(Y\X)}. 

We call M p marginal part manifold of p and call E p 
conditional part manifold of p. These manifolds have 
the following properties: 

• M p DE p = {p} 

• M p is m-flat. E p is m-flat and e-flat. 

• M p ±E p 

Let M. be any manifold, let P m (M) be the m- 
projection operator onto M i.e., 

qP m {M) = arg min KL(q\\r), 

and let P e (M) be the e-projection operator onto M. 
i.e., 

qP e (M) = arg min KL(r\\q). 

Then: 

• P m (E p ) replaces the conditional part: 
qP m (E p ) = q(X)p(Y\X). 

• P m {M p ) replaces the marginal part: 
qP m (M p )=p(X)q(Y\X). 

(E p ) is a linear operator. 

. P e (M p ) = P m (M p ). 

All properties in this subsection are easily proved, but 
we skip their proofs to save space. 

Bregman Divergence 

Let p, q be any vectors in R N and / be a continuously- 
diffcrentiable, real-valued, and strictly convex func- 
tion. Bregman divergence is defined as: 

Bf(p\\q) = f(p)-f(q)-Vf(q)-(p-q). 

where • denotes the inner product operator. We call 
/ potential of B. Bregman divergence has following 
properties: 

. B f (p\\q)>0,B f (p\\q) = 0^p = q. 

• For any linear function l(p) = a ■ p + b, 
B f+i = B f- 

. min p f(p) = f(q) = => B f (p\\q) = f(p) 



